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Abstract 

We propose to use MapReduce to quickly test new retrieval approaches 
on a cluster of machines by sequentially scanning all documents. We 
present a small case study in which we use a cluster of 15 low cost ma- 
chines to search a web crawl of 0.5 billion pages showing that sequential 
scanning is a viable approach to running large-scale information retrieval 
experiments with little effort. The code is available to other researchers 
at: littp://mirex. soiirceforge.net 

1 Introduction 

A lot of research in the field of information retrieval aims at improving the 
quality of search results. Search quality might for instance be improved by 
new scoring functions, new indexing approaches, new query (re-)formulation 
approaches, etc. To make a scientific judgment of the quaUty of a new search 
approach, it is good practice to use so-called benchmark test collections, such as 
those provided by TREC [10]. The following steps typically need to be taken: 

1. The researcher codes the new approach by adapting an experimental 
search system, such as Lemur [5], PF/Tijah [8], or Terrier [9]; 

2. The researcher uses the system to create an inverted index on the docu- 
ments from the test collection; 

3. The researcher puts the queries to the experimental search engine and 
gathers the top X search results (a common value for TREC experiments 
is X = 1000); 

4. The researcher compares the top X to a golden standard by computing 
standard evaluation measures such as mean average precision. 
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In our experience, Step 1, actually coding the new approach, takes by far the 
most effort and time when conducting an information retrieval experiment. Cod- 
ing new retrieval approaches into existing search engines like Lemur, PF/Tijah 
and Terrier is a tedious job, even if the code is maintained by members of the 
same research team. It requires detailed knowledge of the existing code of the 
search engine, or at least, knowledge of the part of the code that needs to be 
adapted. Radical new approaches to information retrieval, i.e., approaches that 
need information that is not available from the search engines inverted index, 
require reimplementing part of the indexing functionality. Such radical new ap- 
proaches arc therefore not often evaluated, and most research is done by small 
changes to the system. 

In his WSDM keynote lecture. Dean [3] describes how MapReduce [4] is used 
at Google for experimental evaluations. New ranking ideas are tested off-line on 
human rated query sets similar to the queries from TREC. Running such off- 
line tests has to be easy for the researchers at Google, possibly at the expense of 
the efficiency of the prototype. So, it is okay if it takes hours to run for instance 
10,000 queries, as long as the experimental infrastructure allows for fast and 
easy coding of new approaches. A similar experimental setup was followed by 
Microsoft at TREC 2009: Craswell ct al. [2] use DryadLINQ [12] on a cluster 
of 240 machines to run web search experiments. Their setup also sequentially 
scans all document representations, providing a flexible environment for a wide 
range of experiments. The researchers plan to do many more to discover its 
benefits and limitations. 

The work at Google and Microsoft shows that sequential scanning over large 
document collections is a viable approach to experimental information retrieval. 
Some of the advantages are: 

1. Researchers spend less time on coding and debugging new experimental 
retrieval approaches; 

2. It is easy to include new information in the ranking algorithm, even if 
that information would not normally be included in the search engine's 

inverted index; 

3. Researchers are able to oversee all or most of the code used in the exper- 
iment; 

4. Large-scale experiments can be done in reasonable time. 

We show that indeed sequential scanning is a viable experimental tool, even if 
only a few machines are available. In Section 2 we describe the MapReduce 
search system. Sections 3 and 4 contain experimental results and concluding 
remarks. 

2 Sequential Search in MapReduce 

MapReduce is a framework for batch processing of large data sets on clusters of 
commodity machines [4]. Users of the framework specify a map^jer function that 
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processes a key/value pair to generate a set of intermediate key/value pairs, and 
a reducer function that processes intermediate values associated with the same 
intermediate key. The pseudo code in Figure 1 outHnes our sequential search 
implementation. The implementation does a single scan of the documents, pro- 
cessing all queries in parallel. 



mapper (Docid, DocText) = 

FOREACH (Query ID, QueryText) IN Queries 

Score = experimental-score (QueryText, DocText) 
IF (Score > 0) 

THEN OUTPUT (Queryld, (DocId, Score)) 

reducer (Queryld, DocIdScorePairs) = 
RankedList = ARRAY [1000] 

FOREACH (DocId, Score) IN DocIdScorePairs 

IF (NOT f illed(RankedList) OR 

Score > smallest-Score (RankedList)) 

THEN ranked_insert (RankedList, (DocId, Score)) 
FOREACH (DocId, Score) IN RankedList 

OUTPUT (Queryld, DocId, Score) 



Figure 1: Pseudo code for linear search 

The mapper function takes pairs of document identifier and document text 
(DocId, DocText). For each pair, it runs all benchmark queries and outputs 
for each matching query the query identifier as key, and the pair document 
identifier and score as value. In the code. Queries is a global constant per 
experiment. The Map Reduce framework runs the mappers in parallel on each 
machine in the cluster. When the map step finishes, the framework groups 
the intermediate output per key, i.e., per Queryld. The reducer function then 
simply takes the top 1000 results for each query identifier, and outputs those 
as the final result. The reducer fucntion is also applied locally on each machine 
(that is, the reducer is also used as a combiner [4]), making sure that at most 
1000 results have to be sent between machines after the map phase finishes. 

3 Case Study: ClueWeb09 

The ClueWcbOQ test collection consists of 1 billion web pages in ten languages, 
collected in January and February 2009. The dataset is used by several tracks 
of the TREC conference [10]. We used the English pages from the collection, 
about 0.5 billion pages equalling 12.5 TB (2.5 TB compressed). The size of 
the ClueWebOQ collection cannot be handled by a single machine, unless one is 
willing to buy special hardware. We ran our experiments on a small cluster of 15 
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machines; each machine costs about € 1000. The cluster runs Hadoop version 
0.19.2 out of the box [11]. 

3.1 Time to code the experiment 

After gaining some experience with Hadoop by having M.Sc. students doing 
practical assignments, wc wrote the code for sequential search, and for anchor 
text extraction in less than a day. Table 1 gives some idea of the size of the 
source code compared to that of experimental search systems. Note that this 
is by no means a fair comparison: The existing systems are general purpose 
information retrieval systems including a lot of functionality, whereas the linear 
search system only knows a single trick. Still, in order to adapt the systems 
below, one at least has to figure out what code to adapt. 



Code base 


#files 


#lines 


size (kb) 


MapReduce anchors search 


2 


350 


13 


Terrier 2.2.1 


300 


59,000 


2,000 


MonetDB/PF/Tijah 0.32.2 


920 


1,393,000 


40,600 


Lemur/lndri 4.11 


1,210 


540,000 


19,500 



Table 1: Size of code base per system 



3.2 Time to run the experiment 

Anchor text extraction on all English documents of ClueWebOQ takes about 11 
hours on our cluster. The anchor text representation contains text for about 
87 % of the docimicnts, about 400 GB in total. A subsequent TREC run using 
50 queries on the anchor text representation takes less than 30 minutes. Our 
linear search system implements a fairly simple language model with a length 
prior without stemming or stop words. It achieves expected precision at 5, 10 
and 20 documents retrieved of respectively 0.42, 0.39, and 0.35 (MTC method), 
similar to the best runs at TREC 2009 [1]. 

Figure 2 shows how the system scales when processing up to 5,000 queries, 
using random sets of queries from the TREC 2009 Million Query track. Re- 
ported times are full Hadoop job times including job setup and job cleanup av- 
eraged over three trials. Processing time increases only slightly if more queries 
are processed. Whereas the average processing time per query is about 35 sec- 
onds per query for 50 queries, it goes down to only 1.6 second per query for 
5,000 queries. For comparison, the graph shows the performance of "Lemur- 
one-node", i.e.. Lemur version 4.11 running on one fourteenth of the anchor 
text representation on a single machine. A distributed version of Lemur search- 
ing the full full anchor text representation would not do faster: It would be 
as fast as the slowest node, it would need to send results from each node to 
the master, and to merge the results. Lemur-one-node takes 3.3 seconds per 
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Figure 2: Processing time for query set sizes 



query on average for 50 queries, and 0.44 seconds on average for 5,000 queries. 
The processing times for Lemur were measured after flushing the file system 
cache. Although Lemur cannot process queries in parallel, the system's perfor- 
mance benefits a lot from receiving a lot of queries. Lemur's performance scales 
sublinearly because it caches intermediate results. Still, at 5,000 queries Lemur- 
one- node is only 3.6 times faster than the Map Reduce system. For experiments 
at this scale, the benefits of the full, distributed Lemur are probably negligible. 

3.3 Related work 

The idea to use sequential scanning of documents to research new retrieval ap- 
proaches is certainly not new: We know of at least one researcher who used 
sequential scanning over ten years ago for his thesis [7]. Without high-level 
programming paradigms like MapReduce, however, efficiently implementing se- 
quential scanning is not a trivial task, and without a cluster of machines the 
approach docs not scale to large collections. 

Lin [6] used Hadoop MapReduce for computing pairwise document similari- 
ties. Our implementation resembles Lin's brute force algorithm that also scans 
document representations linearly. Our approach is simpclcr because our pre- 
processing step does not divide the collection into blocks, nor does it compute 
document vectors. 



4 Conclusion 

A faster turnaround of the experimental cycle can be achieved by making cod- 
ing of experimental systems easier. Faster coding means one is able to do more 
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experiments, and more experiments means more improvement of retrieval per- 
formance. We implemented a full experimental retrieval system with little effort 
using Hadoop MapReduce. Using 15 machines to search a web crawl of 0.5 bil- 
lion pages, the proposed MapReduce approach is less than 10 times slower than 
a single node of a distributed inverted index search system on a set of 50 queries. 
If more queries are processed per experiment, the processing times of the two 
systems get even more close. The code used in our experiment is open source 
and available to other researchers at: http://mirex.sourceforge.net 
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